Hashing Graphs Trees Search Trees Indexing and Multiway Trees File Organization

Introduction

The Table abstract data type

Implementations of the table data structure

Hash Tables

Bucket Array

Hash Function

Hash Code

Compression Functions

Collision-Handling Schemes

Collision likelihoods and load factors for hash tables

Hash Table Implementation

A simple Hash Table in operation

Strategies for dealing with collisions

Linear Probing

Double Hashing

Complexity of hash tables

let's break down the concepts of hash tables, collision resolution, and the two main approaches: separate chaining and open addressing. We'll also delve into load factors, rehashing, and the different open-addressing strategies.


Understanding Hash Tables and Collision Resolution


Hash tables are a fundamental data structure used to implement associative arrays or maps. The idea is to use a hash function to map keys to indices in an array, called buckets. The goal is to achieve constant-time complexity for basic operations like find, put, and erase.


However, collisions occur when two distinct keys produce the same hash value. This challenge led to the development of collision resolution techniques.


Separate Chaining

Separate chaining is an efficient way to handle collisions. Each bucket in the array contains a small map (or list) that chains together entries with the same hash value. This approach delegates operations to the mini-maps at each bucket, making basic operations straightforward.


For example, if we want to find a key, we simply look in the mini-map at the hashed index. If a collision occurs, we iterate through the linked list until we find the desired key.


Load Factor and Expected Running Time


Load factor (λ), the ratio of the number of entries to the capacity of the hash table, is crucial. For open addressing, maintaining λ < 0.5 is recommended, while for separate chaining, keeping λ < 0.9 is advised.

The expected running time of operations in a hash table is influenced by the load factor. With a good hash function, the expected time complexity is O(⌈n/N⌉). If n is kept low relative to N, operations run in near constant time.


Open Addressing: Linear Probing, Quadratic Probing, and Double Hashing


Open addressing avoids auxiliary structures like linked lists for collision resolution but requires dealing with collisions differently. There are several open-addressing schemes, each with its advantages and disadvantages.


Linear Probing


Linear probing is a collision resolution technique used in hash tables, specifically in open-addressing schemes. When a collision occurs (i.e., two keys produce the same hash value), linear probing attempts to find the next available slot in the hash table to place the collided entry.

Here's how linear probing works:

1. Hashing the Key: When inserting a key-value pair into the hash table, you first hash the key to determine its index in the array.

2. Collision Detection: If the calculated index is already occupied by another entry (collision occurs), linear probing comes into play.

3. Linear Search for Available Slot: Linear probing involves searching for the next available (empty) slot in the hash table, starting from the collision index and moving linearly through the array until an empty slot is found.

4. Insertion: Once an empty slot is found, the collided entry is inserted into that slot.

Here's a simple example to illustrate linear probing:

Suppose we have a hash table with 10 slots (0 to 9) and the following keys hash to the same index (collision): key1, key2, and key3.


1. Hashing the Keys:

  • key1 hashes to index 4.
  • key2 hashes to index 4 (collision with key1).
  • key3 also hashes to index 4 (collision with key1 and key2).

  • 2. Collision Detection:
  • Since index 4 is already occupied by key1, a collision is detected.

  • 3. Linear Search for Available Slot:
  • Linear probing starts from index 4 and moves linearly through the array to find the next empty slot.
  • It checks index 5, then index 6, and so on, until it finds an empty slot (let's say index 7).

  • 4. Insertion:
  • key2 is inserted into index 7, resolving the collision.
  • If subsequent collisions occur, the same process is repeated until an empty slot is found.


  • Linear probing is relatively straightforward to implement and requires minimal additional memory overhead. However, it can lead to clustering, where consecutive slots become occupied, potentially slowing down search operations. Additionally, deletion in linear probing can be more complex, often involving marking deleted slots rather than removing them entirely. Despite these drawbacks, linear probing remains a popular collision resolution technique due to its simplicity and efficiency in certain scenarios.




    Quadratic Probing (also known as double hashing)


    Quadratic probing uses a quadratic function to find the next available slot. It avoids the clustering seen in linear probing but introduces secondary clustering.

    Quadratic probing is a technique used in open addressing to resolve collisions in hash tables. When a collision occurs (i.e., two keys hash to the same index), quadratic probing iteratively searches for the next available slot by using a quadratic function. The quadratic function is applied to the original hash value until an empty slot is found.


    Here's a step-by-step explanation of how quadratic probing works:


    1. Collision Occurs:

    Two keys hash to the same index, causing a collision.


    2. Quadratic Probing:

    A quadratic function of the form (f(j) = j2) is used to calculate the next position to check. The idea is to explore positions with increasing quadratic offsets.


    3. Probe Sequence:

    The probing sequence is given by ((h(k) + f(0)) mod N, (h(k) + f(1)) mod N, (h(k) + f(2)) mod N, ldots), where (h(k)) is the original hash value, (f(j)) is the quadratic function, and (N) is the size of the hash table.


    4. Searching for an Empty Slot:

    Starting from the hashed index (h(k)), quadratic probing searches for the next available slot in the probing sequence. If the slot is occupied, it continues to the next position based on the quadratic function.


    5. Insertion or Retrieval:

    Once an empty slot is found, the key-value pair can be inserted into the hash table during an insertion operation. During a retrieval operation, the algorithm continues probing until it either finds the key or encounters an empty slot.


    6. Avoiding Clustering:

    Quadratic probing helps avoid primary clustering, a common issue in linear probing where consecutive occupied slots form clusters. The quadratic function introduces variability in the probe sequence, preventing clustering.


    7. Completing the Search:

    The process continues until an empty slot is found or the entire probing sequence is explored. If the entire sequence is searched and no empty slot is found, the algorithm may need to resize the hash table and rehash the elements.


    8. Deletion:

    Deleting an element in quadratic probing can be more complex than insertion or retrieval. When deleting an element, a special marker may be used to indicate a deleted slot, allowing the algorithm to skip over it during future searches.


    While quadratic probing helps reduce clustering, it introduces a different type of clustering known as secondary clustering. Despite this, quadratic probing is a commonly used method in open addressing schemes, offering a good compromise between simplicity and efficiency. Careful selection of the hash function and addressing potential issues like table size being a prime number can optimize the performance of quadratic probing.




    Double Hashing


    Double hashing employs a secondary hash function to find the next slot. It doesn't suffer from clustering but requires careful choice of the secondary hash function.

    Double hashing is an open-addressing collision resolution technique used in hash tables. Unlike separate chaining, where collisions are resolved by storing multiple items in the same bucket, double hashing aims to find alternative locations within the hash table to place items that collide.


    How Double Hashing Works


    1. Primary Hashing: Like any hash table, double hashing begins with a primary hash function. This function determines the initial position where an item should be stored based on its key.

    2. Secondary Hashing: If a collision occurs at the primary hash location, double hashing employs a secondary hash function. This function generates an offset, which determines how far away from the primary position the algorithm should search for an available slot.

    3. Probe Sequence: The algorithm follows a sequence of probe steps, determined by the secondary hash function, until it finds an empty slot or exhausts the search.

    4. Insertion or Search: When inserting an item, the algorithm places it in the first available empty slot along the probe sequence. During search operations, it traverses the same probe sequence to locate the desired item.


    Example


    Let's say we have a hash table with 10 buckets and the following primary and secondary hash functions:


  • Primary hash function: (h1(k) = k mod 10)
  • Secondary hash function: ( h2(k) = 7 - (k mod 7))

  • Now, suppose we want to insert an item with key (k = 23) into the hash table:

    1. The primary hash value is (h1(23) = 23),
    The primary hash function indicates bucket 3 as the initial position.

    2. Collision: Bucket 3 is already occupied. Now, we apply double hashing.

    3. Secondary Hashing: (h2(23) = 7 - (23 mod 7) = 7 - 2 = 5)
    The secondary hash function calculates an offset of 5.

    4. Probe Sequence: We start probing from the next position, which is bucket 8 (3 + 5).
    If bucket 8 is occupied, we move to the next position, i.e., bucket 9 (8 + 1).
    This process continues until an empty slot is found. 5. Insertion: Suppose bucket 8 is available. We insert the item with key 23 into bucket 8.


    Advantages and Considerations


    Avoids Clustering: Double hashing tends to distribute items more evenly across the hash table compared to linear or quadratic probing, reducing the risk of clustering.

    No Auxiliary Data Structures: Unlike separate chaining, double hashing doesn't require additional linked lists or arrays to handle collisions, saving memory overhead.

    Choice of Hash Functions: Selecting suitable primary and secondary hash functions is critical for the efficiency of double hashing. They should produce sufficiently unique offsets to prevent clustering and ensure a wide dispersion of items across the table.

    Load Factor Consideration**: As with other open-addressing schemes, maintaining a low load factor is essential to prevent excessive probing and ensure efficient operations.

    Double hashing provides an alternative approach to collision resolution in hash tables, offering a balance between simplicity and efficiency when compared to other open-addressing techniques.




    Load Factors and Rehashing


    Maintaining a low load factor is essential for open addressing to prevent clusters and ensure efficiency. Rehashing, the process of resizing and reinserting elements, is performed when the load factor exceeds a threshold.

    Rehashing into a new, larger table allows for better distribution of elements and ensures constant-time operations. If the table size doubles during rehashing, the amortized cost is acceptable.


    Conclusion


    Hash tables are a powerful tool for implementing maps, providing efficient search, insertion, and deletion operations. Collision resolution methods, load factor management, and rehashing strategies are crucial elements in designing a reliable and performant hash table.

    By understanding these concepts, developers can make informed decisions when choosing between separate chaining and open addressing based on their specific requirements and constraints.

    Compression Function:


    A compression function is a crucial component of a hash function. Its role is to transform a potentially large range of hash codes into a smaller, manageable range suitable for indexing into a bucket array. This ensures that the hash codes generated for keys can be mapped to valid indices within the bucket array, preventing errors such as negative indices or indices exceeding the array's capacity. The compression function helps maintain the efficiency and integrity of hash table operations by mapping keys to appropriate locations in the bucket array.